docs(simd): TD-SIMD-8 — F16 honesty + matrix audit for missing lanes#178
Merged
Conversation
# F16 honesty (TD-SIMD-8) `src/simd_half.rs` F16x16: docstring now explicitly discloses scalar storage and routes hot loops to `core::simd::f16x16` (under `nightly-simd`) or to fp32 with conversion at boundaries. Disambiguates from `simd_avx2::F16Scaler` — a scaling CONTEXT for range-normalizing values before f16 encoding, not the F16x16 SIMD type. Both files cross- reference each other so a future reader doesn't repeat the confusion. `src/simd_avx2.rs` F16Scaler: docstring strengthened with the same disambiguation note. # Matrix audit (user request) Cross-referenced every `pub struct *x*` in simd_avx512.rs, simd_avx2.rs, simd_neon.rs, simd_nightly/mod.rs against the parity matrix in the architecture doc. Corrections: - **F32x8 / F64x4 v3 column: ❌ → ✅ `__m256`/`__m256d` (in `simd_avx512`)**. The dispatch at `src/simd.rs:294` already imports these from simd_avx512 on the v3 / AVX2 path. They're AVX (not AVX-512), so they work on every Sandy Bridge+ host. The matrix was stale. - **U32x8, U64x4 rows added** — nightly-only currently; ❌ on x86 + aarch64 + scalar. core::simd has them via `simd_nightly`. - **U16x16, I32x8, I64x4 rows added** — missing across EVERY backend including nightly. Theoretical 256-bit shapes no consumer has reached for yet. - **F32Mask8 / F64Mask4 rows added** — declared in simd_scalar as `F32Mask8Scalar` / `F64Mask4Scalar` (rename came from a duplicate- decl conflict on i686); not surfaced through `crate::simd::*`. AVX-512 has them natively via `__mmask8` but they're not typed. - **Sub-byte lanes section added** — I4 / U4 lanes used by INT4 quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper; consumers pack 2× nibbles per byte and operate through U8x64 + shr/ mask. Documents the hardware story (AVX-512 VBMI2, VPCOMPRESSB on x86; shr+mask trick on aarch64). Tracked as TD-SIMD-11 if a consumer files for it. TD-SIMD-8 description updated in §5 to point at `simd_half.rs:123` (the actual F16x16 polyfill) rather than `simd_avx2.rs:2566` (the unrelated F16Scaler scaling utility).
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 20, 2026
…ss all backends PR #178's matrix audit surfaced five 256-bit int lane types that were either entirely missing or stranded in `simd_nightly` only. Adds them across every backend so `crate::simd::{U16x16, U32x8, U64x4, I32x8, I64x4}` resolves uniformly on v3 / v4 / native / nightly / scalar / aarch64 paths. `src/simd_avx2.rs` + 5× `avx2_int_type!` instantiations producing scalar-storage `[$elem; $lanes]` polyfills (align 64). Same macro pattern as the existing 512-bit polyfills (U8x64, U16x32, …). Native AVX2 `__m256i` upgrades are TD-SIMD-3. + 5× lowercase aliases (`u16x16 = U16x16`, etc.) matching the std::simd convention used by every other lane type in the file. `src/simd_scalar.rs` + 5× `impl_int_type!` instantiations mirroring the AVX2 polyfills above. Consumers on non-x86/non-aarch64 (wasm32, riscv, thumb) reach the same type names through `crate::simd::*`. + Lowercase aliases. `src/simd_avx512.rs` + Re-export of the new types from `simd_avx2` so the v4 dispatch arm in `simd.rs` can surface them without forking the macro into this file. Both files are already gated on `target_arch = "x86_64"`, so the re-export is cheap. Native `__m256i` upgrades here are TD-SIMD-3 (same story as the v3 polyfills). `src/simd_nightly/u_word_types.rs` + `U16x16` wrapper backed by `core::simd::u16x16`. Same API surface as the existing 32-/16-/8-lane wrappers — splat, from_slice, from_array, to_array, copy_to_slice, reduce_{sum,min,max}, simd_min/max, cmpeq_mask, cmpgt_mask, Default. `src/simd_nightly/i_word_types.rs` + `I32x8` and `I64x4` wrappers backed by `core::simd::{i32x8, i64x4}`. Same API surface as siblings; PartialEq via array compare. `src/simd_nightly/mod.rs` + Re-exports for the three new types + lowercase aliases. `src/simd.rs` + All 5 dispatch arms (nightly, v4, v3, aarch64, scalar fallback) updated to surface the new types through `crate::simd::*`. `.claude/knowledge/simd-dispatch-architecture.md` + Parity matrix updated — the five rows previously marked ❌ across most backends now show 🟠 polyfill (v3, v4-via-v3, scalar) / 🔵 (nightly via `core::simd`). Verified: `cargo check` clean under default v3 features and under `-Ctarget-cpu=x86-64-v4` (via `CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUSTFLAGS` + explicit `--target` so build scripts don't SIGILL on non-AVX-512 runners — same pattern as the tier4-avx512-check job).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TD-SIMD-8 (F16 honesty) + matrix audit for missing lane wrappers (U16/U32/U64, I4/8/16/32/64, F32 — user request).
F16 honesty
src/simd_half.rs::F16x16— docstring now explicitly discloses scalar[u16; 16]storage and routes hot loops tocore::simd::f16x16(undernightly-simd) or to fp32 with conversion at boundaries.simd_avx2::F16Scaler— that's a scaling context for range-normalizing values before f16 encoding, NOT the F16x16 SIMD type. Both files now cross-reference each other.Matrix corrections
Cross-referenced every
pub struct *x*insimd_avx512.rs,simd_avx2.rs,simd_neon.rs,simd_nightly/mod.rsagainst the parity matrix. Found these gaps:F32x8v3: ❌ → ✅__m256(insimd_avx512)src/simd.rs:294already imports it on v3 path; it's AVX (not AVX-512), works Sandy Bridge+F64x4v3: ❌ → ✅__m256d(insimd_avx512)U32x8row addedU64x4row addedU16x16row addedI32x8row addedI64x4row addedF32Mask8row addedF32Mask8Scalarinsimd_scalar; not surfaced throughcrate::simd::*F64Mask4row addedF64Mask4Scalarinsimd_scalar; not surfacedSub-byte lanes section added
I4/U4(4-bit nibbles) used by INT4 quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper exists anywhere — consumers pack 2× nibbles per byte and operate throughU8x64withshr_epi16+& 0x0Fmasks. Documents the hardware story (AVX-512 VBMI2VPCOMPRESSB, VPMADD52 on x86; shr+mask on aarch64). Tracked as TD-SIMD-11 if a consumer files for it.TD-SIMD-8 row updated
§5 entry now points at
src/simd_half.rs:123(the actual F16x16 polyfill) rather than the unrelatedF16Scaleratsimd_avx2.rs:2566. Documents the three remediation options: (a) wire_mm256_cvtph_psundertarget_feature = "f16c"(Ivy Bridge+; all AVX-512 hosts), (b)F16x16Scalaralias to make scalar nature explicit at consumer call sites, (c) type-level doc-warning. ~80 LoC estimate.Test plan
cargo checkpaths unchanged.cargo fmt --checkclean (no Rust code changed beyond two doc comments).Generated by Claude Code